[SPARK-4277] Support external shuffle service on Standalone Worker#3142
[SPARK-4277] Support external shuffle service on Standalone Worker#3142aarondav wants to merge 4 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
I realized that these checks were not useful (since the secret key is already doing the security checks). Additionally, it doesn't make sense on the Worker.
|
@pwendell Perhaps you could review this? |
|
Test build #23009 has started for PR 3142 at commit
|
|
Test build #23012 has started for PR 3142 at commit
|
|
Test build #23009 has finished for PR 3142 at commit
|
|
Test FAILed. |
|
Test build #23014 has started for PR 3142 at commit
|
|
Test build #23012 has finished for PR 3142 at commit
|
|
Test PASSed. |
|
Test build #23014 has finished for PR 3142 at commit
|
|
Test FAILed. |
|
retest this please |
There was a problem hiding this comment.
Should we have some node-level environment variable in addition to this? The point is that if the Worker is started on a different port then it would be good if the executors can just pick it up. Otherwise the executors have no way of figuring out which other port the service is started on unless they go check out the worker logs or something. In Yarn for instance we have shared Yarn config used by both the executors and the NM. Though this is probably fine for first-cut implementation. Any thoughts @pwendell
|
Test build #23022 has started for PR 3142 at commit
|
There was a problem hiding this comment.
I wouldn't be opposed to this being called StandaloneShuffleService now that you renamed the other one ExternalShuffleService. This name would match YarnShuffleService more. The existing name is also fine.
|
Addressed or responded to all comments, save the environment variable one. |
|
Test build #23027 has started for PR 3142 at commit
|
|
Test build #23022 has finished for PR 3142 at commit
|
|
Test PASSed. |
|
Test build #23027 has finished for PR 3142 at commit
|
|
Test PASSed. |
|
Ok, LGTM. Now the dynamic allocation feature can be used on standalone mode too. Super cool. I merge. |
Author: Aaron Davidson <aaron@databricks.com> Closes #3142 from aarondav/worker and squashes the following commits: 3780bd7 [Aaron Davidson] Address comments 2dcdfc1 [Aaron Davidson] Add private[worker] 47f49d3 [Aaron Davidson] NettyBlockTransferService shouldn't care about app ids (it's only b/t executors) 258417c [Aaron Davidson] [SPARK-4277] Support external shuffle service on executor (cherry picked from commit 6e9ef10) Signed-off-by: Andrew Or <andrew@databricks.com>
|
Hey Aaron - this LGTM and seems pretty straightforward. Just wondering, what is the behavior if it can't bind to that port? I'm thinking about the case where multiple workers are running on each machine. Will it crash the worker or just e.g. send a log message? If the latter, I think this would still work correctly... it would be a bit hackish, but one worker would grab the port first and then the others would not start a shuffle service. |
|
The current behavior would crash the second worker. We could disable this behavior and revert to logging if SPARK_WORKER_INSTANCES is set and > 1, but I would be hesitant to just always catch errors from starting the server, as it would just continuously produce failing executors. |
No description provided.